A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
نویسندگان
چکیده
The purpose of this dissertation was to investigate how different Item Response Theory (IRT)-based calibration methods affect student achievement growth pattern recovery. Ninety-six vertical scales (4 × 2 × 2 × 2 ×3) were constructed using different combinations of IRT calibration methods (separate, pair-wise concurrent, semiconcurrent, and concurrent), lengths of common-item set (10 vs. 20 common items), types of common-item set (dichotomous-only vs. mixed-format), and numbers of polytomous items (6 vs. 12) for three simulated datasets differing in the number of examinees sampled per grade (500, 1000, 5000). Three indexes (absolute bias, standard error of equating and root mean square error) were used to evaluate the performance of the calibration methods on proficiency score distribution recovery over 80 replications. These indexes were derived for seven growth distribution criterion parameters (mean, standard deviation, effect size, and proportions of examines within four proficiency categories). Although exceptions were found in the results for all criterion parameters, important general trends did emerge. Pair-wise concurrent and semi-concurrent calibration methods performed better than concurrent and separate calibration methods for most criterion parameters and combinations of research conditions. Separate calibration, the vertical scaling method used most often in practice, provided the poorest results in most instances. Accuracy of vertical scaling also typically improved with larger samplings of examinees, more common items, mixing item formats in the common-item set, and increases in the number of polytomous items in the common-item set. General trends and exceptional cases from the various analyses are described in
منابع مشابه
Assessing IRT Model-Data Fit for Mixed Format Tests
This study examined various model combinations and calibration procedures for mixed format tests under different item response theory (IRT) models and calibration methods. Using real data sets that consist of both dichotomous and polytomous items, nine possibly applicable IRT model mixtures and two calibration procedures were compared based on traditional and alternative goodnessof-fit statisti...
متن کاملPractical Consequences of Item Response Theory Model Misfit in the Context of Test Equating with Mixed-Format Test Data
In item response theory (IRT) models, assessing model-data fit is an essential step in IRT calibration. While no general agreement has ever been reached on the best methods or approaches to use for detecting misfit, perhaps the more important comment based upon the research findings is that rarely does the research evaluate IRT misfit by focusing on the practical consequences of misfit. The stu...
متن کاملplink: An R Package for Linking Mixed-Format Tests Using IRT-Based Methods
This introduction to the R package plink is a (slightly) modified version of Weeks (2010), published in the Journal of Statistical Software. The R package plink has been developed to facilitate the linking of mixed-format tests for multiple groups under a common item design using unidimensional and multidimensional IRT-based methods. This paper presents the capabilities of the package in the co...
متن کاملA Comparison of Different Methods That Deal with Construct Shift in Value Added Modeling: Is Vertical Scaling Necessary?
Title of Document: A COMPARISON OF DIFFERENT METHODS THAT DEAL WITH CONSTRUCT SHIFT IN VALUE ADDED MODELING: IS VERTICAL SCALING NECESSARY? Yong Luo, Doctor of Philosophy, 2013 Directed By: Professor Hong Jiao Department of Human Development and Quantitative Methodology Construct shift is a term used to describe the change of tests in the construct they intend to measure. In tests across multip...
متن کاملA Reexamination of Lord’s Wald Test for Differential Item Functioning Using Item Response Theory and Modern Error Estimation
The detection of differential item functioning (DIF) is an essential step in increasing the validity of a test for all groups. The item response theory (IRT) model comparison approach has been shown to be the most flexible and powerful method for DIF detection; however, it is computationally-intensive, requiring many model-refittings. The Wald test, originally employed by Lord for DIF detection...
متن کامل